Tentative support for avx512vl extensions to 256 bit registers#1345
Tentative support for avx512vl extensions to 256 bit registers#1345serge-sans-paille wants to merge 4 commits into
Conversation
|
Are we sure that we only need avx512f for this? It seems to me that instructions like https://diamondinoia.com/simdref/#_mm256_cmp_epi32_mask requires avx512f + VL. Let me know where I am wrong. Cheers, |
|
You're right, all of this requires avx512f+avx512vl. It turns out most build have both (see https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) but we currently don't have anything to model avx512vl, which should be the parent of avx512f_256. |
|
I guess it might be called avx512vl_256 at that point as we will also have avx512vl_128. What do you think? I agree, most CPU have 512+extensions. |
It looks like avx512vl does not have any 512bit instruction :-) but it still has avx512 in its name and it implies avx512f, so I agree with you. |
|
Actually, my suggestion might be wrong: https://diamondinoia.com/simdref/#_mm_maskz_andnot_pd There are instructions that require DQ + VL for example. |
|
based on the graph above, looks like DQ => VL, so we should still be fine? |
|
It seems like it, also basically all architectures listed in the chart have DQ,VL,BW together. So in any case is should be fine. |
|
@DiamonDinoia I've move the minimal creation of avx512vl to #1350 , once merged I'll rebase this PR |
|
Sure! Ping me when you need a review. It might be spotty when I'm on holiday |
avx512vl just extends 128 and 256 bits register with some operations, it does not have any 512 bit instructions, so the description is mostly empty and preliminary work for #1345
4b5ae7d to
9d15e04
Compare
9d15e04 to
ecdac94
Compare
ping ;-) |
There was a problem hiding this comment.
Just some suggestions and maybe we could add to the tests:
TEST_CASE_TEMPLATE_DEFINE("batch_bool mask hygiene", B, batch_bool_hygiene_id)
{
using value_type = typename B::value_type;
using batch_type = xsimd::batch<value_type, typename B::arch_type>;
SUBCASE("any(a != a) is false")
{
batch_type a(value_type(1));
CHECK_FALSE(xsimd::any(a != a));
}
SUBCASE("any(~(a == a)) is false")
{
batch_type a(value_type(1));
CHECK_FALSE(xsimd::any(~(a == a)));
}
SUBCASE("eq(false_mask, false_mask) is all-true")
{
auto m0 = (batch_type(value_type(1)) != batch_type(value_type(1)));
CHECK_UNARY(xsimd::all(m0 == m0));
}
SUBCASE("from_mask ignores bits above lane count")
{
constexpr std::size_t N = B::size;
uint64_t valid_all_true = (N == 64) ? ~uint64_t(0)
: ((uint64_t(1) << N) - 1);
uint64_t junk = valid_all_true | (uint64_t(1) << 63);
B clean = B::from_mask(valid_all_true);
B dirty = B::from_mask(junk);
CHECK_EQ(clean.mask(), dirty.mask());
CHECK_UNARY(xsimd::all(clean == dirty));
}
SUBCASE("batch_bool stored to bool[] is canonical 0/1")
{
batch_type a(value_type(1)), b(value_type(2));
auto m = (a == b); // all false
alignas(64) bool buf[B::size + 1] = {true, true}; // sentinel
xsimd::store_aligned(buf, m);
for (std::size_t i = 0; i < B::size; ++i)
{
// bit-level check: must be exactly 0, not just falsy
CHECK_EQ(*reinterpret_cast<uint8_t const*>(&buf[i]), uint8_t(0));
}
}
}
TEST_CASE_TEMPLATE_APPLY(batch_bool_hygiene_id, batch_bool_types);
```cpp
TEST_CASE_TEMPLATE_DEFINE("store_masked respects Mode", B, store_masked_mode_id)
{
using T = typename B::value_type;
using A = typename B::arch_type;
constexpr std::size_t N = B::size;
// Unaligned-mode + unaligned pointer: must not fault.
alignas(64) T big[2 * N + 1] = {};
T* unaligned_p = big + 1; // sizeof(T)-aligned only
struct AllTrue { static constexpr bool get(std::size_t, std::size_t) { return true; } };
auto cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
B v(T(7));
xsimd::kernel::store_masked(unaligned_p, v, cst,
xsimd::unaligned_mode{},
xsimd::kernel::requires_arch<A>{});
for (std::size_t i = 0; i < N; ++i)
CHECK_EQ(unaligned_p[i], T(7));
// Overload resolution: store_masked with same-typed mask must compile.
// If C3 regresses, this becomes a compile error.
auto signed_cst = xsimd::make_batch_bool_constant<T, AllTrue, A>();
alignas(64) T aligned_buf[N] = {};
xsimd::kernel::store_masked(aligned_buf, v, signed_cst,
xsimd::aligned_mode{},
xsimd::kernel::requires_arch<A>{});
for (std::size_t i = 0; i < N; ++i)
CHECK_EQ(aligned_buf[i], T(7));
}
TEST_CASE_TEMPLATE_APPLY(store_masked_mode_id, batch_types_for_masked_store);| template <class A, class T, class = std::enable_if_t<std::is_integral<T>::value>> | ||
| XSIMD_INLINE batch_bool<T, A> neq(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| return ~(self == other); |
There was a problem hiding this comment.
I think this flips junk into bits above batch_bool::size.
We might need to:
constexpr auto active_mask = register_type(register_type(-1) >> (sizeof(register_type) * 4));
return ~(self == other) & active_mask;
There was a problem hiding this comment.
yes for the others, but for this one everything is done at batch level so we're good (provided operator== is correct)
There was a problem hiding this comment.
Actually as we always ignore the upper bits, I think we're good
There was a problem hiding this comment.
I was thinking about if there are "horizontal" operations like and, or and so on. I'm not sure there is a vl variant that ignores these bits when using them. Otherwise when using the full avx512 variant we have to 0 those bits manually.
| XSIMD_INLINE batch_bool<T, A> eq(batch_bool<T, A> const& self, batch_bool<T, A> const& other, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| using register_type = typename batch_bool<T, A>::register_type; | ||
| return register_type(~self.data ^ other.data); |
There was a problem hiding this comment.
what's the issue with upper bits being garbage?
There was a problem hiding this comment.
I've changed uint64_t mask(batch_bool<T, A> const& self, requires_arch<avx512vl_256>) noexcept to apply the mask, because that's where the garbage bits may leak.
ecdac94 to
7ce75ef
Compare
avx512vl just extends 128 and 256 bits register with some operations, it does not have any 512 bit instructions, so the description is mostly empty and preliminary work for #1345
e5816f7 to
085ad2f
Compare
|
All green! |
DiamonDinoia
left a comment
There was a problem hiding this comment.
Hi,
Can you check that the following code produces the expected output? Because on my machine I get that a == a is true and a != is also true in some cases for example.
Also are we handling the store/load alignment correctly? I seem to notice (unless overload resolution dispatches to some other types) that we always use unaligned for integer types and always aligned for float/double?
// clang++ -march=x64-64-v4 -I/home/marco/repos/xsimd/include -DXSIMD_DEFAULT_ARCH=xsimd::avx512vl_256 /tmp/xsimd_stray_bits.cpp -o /tmp/xsimd_stray_bits
#include <cstdio>
#include <cstdint>
#include "xsimd/xsimd.hpp"
#include <iostream>
#include <typeinfo>
#include <cxxabi.h>
#include <memory>
// Helper function to demangle typeid names
std::string demangle(const char* mangled) {
int status = 0;
std::unique_ptr<char, decltype(&free)> result(
abi::__cxa_demangle(mangled, nullptr, nullptr, &status),
&free
);
return (status == 0) ? result.get() : mangled;
}
static int fail_count = 0;
#define EXPECT(label, expr, expected) \
do { \
const bool got = (expr); \
const bool ok = (got == (expected)); \
std::printf(" %-58s expected %-5s got %-5s %s\n", \
(label), \
(expected) ? "true" : "false", \
got ? "true" : "false", \
ok ? "PASS" : "FAIL"); \
if (!ok) ++fail_count; \
} while (0)
template <class T>
int check() {
using A = xsimd::avx512vl_256;
using B = xsimd::batch<T, A>;
std::cout << "\n=== " << demangle(typeid(B).name())
<< " (arch=" << demangle(typeid(A).name())
<< ", lanes=" << B::size << ") ===\n";
B a(T(1));
std::cout << " a = " << a << "\n";
// a equals itself
EXPECT("a should equal itself [all(a == a)]",
xsimd::all(a == a), true);
EXPECT("a should not differ from itself [any(a != a)]",
xsimd::any(a != a), false);
// combine two logically empty masks
auto m1 = (a != a);
auto m2 = (a != a);
EXPECT("AND of two empty masks should be empty [any(m1 & m2)]",
xsimd::any(m1 & m2), false);
EXPECT("OR of two empty masks should be empty [any(m1 | m2)]",
xsimd::any(m1 | m2), false);
EXPECT("XOR of identical empty masks should be empty [any(m1 ^ m2)]",
xsimd::any(m1 ^ m2), false);
// complement of an empty mask is a full mask
auto m_not = ~m1;
EXPECT("complement of an empty mask should be full [all(~m1)]",
xsimd::all(m_not), true);
EXPECT("complement of an empty mask should be nonempty [any(~m1)]",
xsimd::any(m_not), true);
// a bool batch equals itself
EXPECT("a bool mask should equal itself [all(m_not == m_not)]",
xsimd::all(m_not == m_not), true);
return 0;
}
int main() {
int fails = 0;
fails += check<int8_t>();
fails += check<int16_t>();
fails += check<int32_t>();
fails += check<int64_t>();
fails += check<float>();
fails += check<double>();
std::printf("%d failures\n", fails);
return fails != 0 ? 1 : 0;
}
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE void store_masked(float* mem, batch<float, A> const& src, batch_bool_constant<uint32_t, A, Values...> mask, Mode, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| _mm256_mask_store_ps(mem, mask.mask(), src); |
There was a problem hiding this comment.
this requires alignment to 32 bytes so we have to specify the Mode
There was a problem hiding this comment.
I went the easy way and forced unaligned store here and below.
| template <class A, bool... Values, class Mode> | ||
| XSIMD_INLINE void store_masked(double* mem, batch<double, A> const& src, batch_bool_constant<uint64_t, A, Values...> mask, Mode, requires_arch<avx512vl_256>) noexcept | ||
| { | ||
| _mm256_mask_store_pd(mem, mask.mask(), src); |
There was a problem hiding this comment.
same here this requires alignment.
085ad2f to
d5c787c
Compare
|
bug found, the all / any implementation was incorrect, fixed and test case added (your example was strange, the |
In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.
-DXSIMD_DEFAULT_ARCH is not a cmake option but a preprocessor option
d5c787c to
4a797bb
Compare
In addition to missing instructions (e.g. bas on int64_t etc) this mostly changes the mask representation from vector register to scalar, thus the big diff.